Exact test for Markov order 



Shawn D. Pethel and Daniel W. Hahs* 
U.S. Army RDECOM, RDMR-WSS, Redstone Arsenal, 
Alabama 35898, USA; *Torch Technologies, Inc., Huntsville, AL 35802 
(Dated: February 7, 2013) 

We describe an exact test of the null hypothesis that a Markov chain is nth order versus the 
alternate hypothesis that it is (n+l)-th order. The procedure does not rely on asymptotic properties, 
but instead builds up the test statistic distribution via surrogate data and is valid for any sample 
size. Surrogate data are generated using a novel algorithm that guarantees, per shot, a uniform 
sampling from the set of sequences that exactly match the nth order properties of the observed 
data. 
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It often happens that it is useful to describe a process as a set of discrete states with probabilistic transitions. 
Examples abound in various fields such as the study of chemical processes pj, DNA sequences Q, finance Q, and 
nonlinear dynamics [Ij], among others. If a transition to a new state is conditioned only on the present state we 
call this model a Markov chain. An nth-order Markov chain is a generalization to include the past n states in the 
transition probability. When the conditional probabilities are not otherwise given, they are estimated from a time 
series of observations. 

If the order of the Markov chain is in question there are various tests and criteria available to narrow down the 
options. A classical approach is to formulate the question as a hypothesis test that a chain is n-th order versus 
(n + l)-th orderQ. When the test statistic has a known limiting distribution, such as x 2 , a p-value can be calculated 
and a decision made based on a chosen significance level @. Another avenue are the information criteria tests such 
as AIC and BIC These produce rankings over multiple orders based on expected likelihood and have built-in 

corrections for over-fitting. Both of these approaches rely on approximations that are only valid in the limit of large 
samples. In small sample situations one cannot be sure of their efficacy. 

It is possible to perform an exact hypothesis test that is valid for any sample size. Instead of relying on asymptotic 
properties, the test statistic distribution is discovered by generatin g s amples (referred to here as surrogates) that 
exactly match the n-th order properties of the observed time series [lfj. The challenge is in efficiently generating a 
large number of such samples, especially for higher orders. To our knowledge no solution to this problem has been 
reported in the literature. The contribution of this work is a surrogate data procedure that has ideal properties: 
one sample is generated per shot, samples are uniformly selected from the set of all possible surrogate sequences, 
computation time increases linearly with the length of the sequence, and any order can be accommodated. Armed 
with this new procedure it is now practical to perform exact hypothesis tests of Markov order. 

We first describe how to do hypothesis testing of Markov order using the \ 2 statistic, for which the distribution 
is known in the large sample limit. Next we describe the method of surrogate data generation based on Whittle's 
formula. Then we compare the x 2 statistic in large and small sample cases using both the asymptotic distribution 
and the exact distribution obtained from the surrogates. 

A sequence of observations {x\ . . .xn} form a Markov chain of order n if the conditional probability satisfies 

p(x t +l\xt,X t -l . . .) =p(x t +l\x t . ..Xt- n+ l), (I) 

for all n < t < N. For convenience we will label the states each measurement can take by positive integers up to M. 
A sequence of discrete measurements may come from a process that is naturally discrete, such as a DNA sequence, 
or from a continuous process that has been discretized by an analog-to-digital measuring device. Unless otherwise 
specified a Markov process is assumed to be first order (n = I). This means that the transition probabilities to a 
future state depend only on the present state and not on prior states. An nth order process can always be cast as 
first order by grouping the present state with the relevant past states into a word, in which case the number of states 
can be up to M n . A process that has no dependence on past or present (such as a random iid process) is said to be 
zeroth order. 

To perform a hypothesis test of n-th order versus (n + I)-th order one begins with an assumption of nth order and 
then computes the distribution of a suitable (n + l)-th order statistic. If the observed (n + I)-th order statistic is 
highly unlikely then n-th order assumption is rejected. The initial assumption is called the null hypothesis and the 
probability of the observed statistic given the distribution implied by the null hypothesis is referred to as the p- value. 
Typically, a p- value less than or equal to 0.05 is taken as grounds to reject the null hypothesis. 

Let us begin with the assumption that {xt} is first order (n = I) and calculate the p- value of a second order statistic 
using a x 2 distribution. The null hypothesis is 

p(x t +i = i\x t = j, x t -i = k) = p(x t+ i = i\x t = j), (2) 

or using Bayes' rule 

/ • n P(xt+i=i,xt = j)p(xt=j,x t -i=k) 

p{x t+ i=i,xt = 3,xt-i = k) = . (3) 

P[xt = 3) 

The l.h.s. of ([3]) multiplied by N — 2 is the expected number of times the word (xt+i =i,x t = j, x t -\ — k) appears 
in the data given the null hypothesis. The quantities on the r.h.s. are not expected values; they are taken from the 
observed sequence. Let E w be the expected word count where Yl E w = N — 2 and w indexes the set of all words for 
which the expected count is greater than zero. Similarly, let O w be the corresponding count from the observed data. 
If the sequence w' does not appear in the observed data, then O w i = 0. We can now define the expected x 2 statistic 
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which is a measure of the deviation of the observed count from the expected. The first order assumption does not 
uniquely determine the second order statistics; there is some freedom for \ 2 to vary from shot to shot even assuming 
the null hypothesis is true. The advantage of the \ 2 statistic is that, given the degrees of freedom d, the distribution 
/(% 2 ;d) is known in the limit N — > oo. The p- value is then obtained by integrating f(x 2 ) over \ 2 ^ X 2 X p- 

A detail not made clear in the literature is how to compute the (n + l)-th order degrees of freedom d needed to 
determine the x 2 distribution. Let F be a transition count matrix of size up to M n x M n . The («, j)-th entry of 
F is the number of times word i transitions to word j. Because words overlap and differ by only one observation, 
there are at most M nonzero entries in row i of F. In the case that all words are present, F can be rearranged in 
block diagonal form with n M x M blocks. In the case that some words are not present in the observed data, these 
blocks will be of differing size. If the size of the fcth block is ru x Ck, then the total number of degrees of freedom d is 
£(r*-l)(c fc -l). 

The hypothesis test as described above is not exact; it relies on the \ 2 distribution in the asymptotic limit of infinite 
data. To discover the exact distribution for finite data one needs to know all possible sequences that satisfy the null 
hypothesis along with their likelihood. For the first order hypothesis these sequences must all have exactly the same 
joint probabilities shown on the r.h.s. of ([3]). Let Fij be the {xt+i =i,xt =j) word count for the observed sequence 
and let S represent the set of sequences with the same F and the same beginning and end state as the observed 
sequence. The members of S all have the same Xexp as the observed sequence (but not necessarily the same second 
order statistics). 

The number of sequences that have the word count F and begin with state u and end with state v is given by 



Whittle's formula [TTjj : 



Nuv(F) = ^§\C W (5) 



where Fp is the sum of row i and C vu is the (v, u)-th cofactor of the matrix 

(6) 




The cofactor is computed by striking out the v-th row and M-th column and taking the determinant. 
As an example, consider the following sequence of twelve binary observations: 

x = {0 1 1 1 1 1 1 1}. (7) 

The sequence x has u = 0, v = 1 and transition count 

F=(lf). (8) 



From ^ we compute 



F* = ( Ji A (9) 
and Cio = det(4/5) = 4/5. Plugging into (|5|) gives 

5' • 6' 4 

"w-jnnrs-" 1 (10) 

The cardinality of the set S(x) is therefore 80. The transition count F determines the first order joint probabilities 
p(x n+ i, x n ) and, after fixing the first and the last symbol, the zeroth order probabilities p(x n ) as well. Therefore all 
80 sequences in S have first order transition probabilities p(x n+ i\x n ) identical to the observed sequence x. 

For all but the shortest sequences the value of (0 is so large that it cannot be computed using fixed precision 
arithmetic. In our algorithm we instead compute the natural logarithm of © using a Stirling's series for the factorial 
terms: 

Inz! - zlnz - z + - ln(27rz) + — 3_ H (11) 

2 v ' 12z 360z 3 1260z 5 1680z 7 v ' 



when z > 16. 
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FIG. 1. The number of times a sequence appears in 10 6 iterations of Whittle's algorithm. The sequences labeled 1 — 80 refer 
to the example in the text. 

To find the p- value we need to know the fraction of sequences have % 2 values less than or equal to Xexp- If \&\ i s 
too large to enumerate all the sequences, the p-value can still be estimated to any desired accuracy provided one has 
a method of producing uniform random samples from the set S. Previously reported methods for generating samples 
from S are impractical, especially for higher order testing 10]. Here we give an efficient procedure. 

We construct a member {y t } of S starting with y± = u, ending with y^ = v, and having the transition count matrix 
F. The candidates for the second element y^ are the set \y2\Fyxy2 > 0}. For each candidate we compute N V2V (F'), 
the number of sequences left, where F[j — Fij — S Viy2 . We choose a candidate randomly in proportion to the number 
of sequences left; a path that leads to a small number of possibilities is chosen less frequently than one that leads to 
a large number. Once yi is chosen F is set equal to the appropriate F' and the process is repeated for y% and so on 
until 2/7v_i is reached. 

Returning to our example case, we have y\ = 0, yn = 1, and yi = {0, 1}. The two choices for y2 lead to the 
following number of sequences: 

N Ql (I fj = 20, 

N u (l fj = 60. (12) 

Therefore yi = is chosen with 20/80 =1/4 probability and yi = 1 with 3/4 probability. By weighting our choice at 
each step by Whittle's formula we guarantee paths that do not result in a valid sequence are not followed and that 
all valid paths are followed with uniform probability (FigO}. This method is suitable for the generation of very long 
surrogates, as the difficulty increases only linearly with N. For producing sequences of order n > 1 simply replace 
the elements of yt with length n words. The matrix F can be as large as M n x M n , but has no more than N — n 
nonzero elements and can be handled efficiently using sparse methods. Code is available for generating surrogates by 
this method 

Figure $2$ shows the % 2 density computed in the asymptotic limit with the density estimated from 20000 surrogates 
of a random Markov process of 4 states. The degrees of freedom are calculated for 2nd order assuming fixed 1st order 
statistics. In the top panel (N — 2500) there is close agreement indicating that that the surrogate data statistics 
behave as expected in the asymptotic limit. The bottom panel uses the same time series, but only the first 200 data 
points. The significant disagreement between the two densities illustrates the need for an exact test when the sample 
is small. 

The efficacy of a hypothesis test is quantified by its size and power. The size of a test is its probability of incorrectly 
rejecting the null hypothesis (Type I error). For an ideal test the size should be equal to the significance level (0.05). 
The power of a test is its probability of correctly rejecting the null hypothesis. The failure to do so is a Type II 
error. To estimate power we use data from Markov processes that are one order higher than the null hypothesis; other 
choices could yield different results. 

Test cases are taken from a set of randomly generated nth order Markov processes with four states (M = 4). Recall 
that such a process is specified by the transition probabilities p(xt+\ \x t . . . Xt- n -\-x), which when expressed as a matrix 
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FIG. 2. The asymptotic x 2 probability density (no marker) and the actual density estimated using surrogate data (circles). 
Data is taken from a randomly generated Markov process, degrees of freedom d are computed for 2nd order statistics given a 
1st order null hypothesis. Top panel (a) utilizes 2500 data points whereas the bottom panel (b) only 200. For the short time 
series (b) the asymptotic distribution differs considerably from the actual. 



is size M n x M, One way to create such a transition matrix is to populate it with [0, 1] random numbers and normalize 
the rows. We have found, however that this procedure tends to produce weakly nth order processes, particularly when 
either or both M and n are large. To produce strongly nth order processes we first scale the random numbers by 
adding one, raising them to the 10th power, and then row normalizing. This creates more variance in the transition 
probabilities. 

We generated 2500 trials for each size and power estimate shown in the following tables. For each case we tabulate 
results using the asymptotic distribution (labeled x 2 ) ano * the exact \ 2 distribution (labeled Xsurg) obtained using 
2500 surrogates. In addition we show H SUIg , which relies on the same surrogate data, but instead of the \ 2 statistic 
the nth order entropy rate is used: 

H(x t +l\x t ■ ..Xt- n+ l) = H(x t+ i,X t ■ ..Xt-n+l) - H(x t ■ ..X t -n+\). (13) 

As the surrogates have identical nth order block entropies, only H(xt+i,xt ■ ■ .Xt— n +i) needs to be re-computed for 
each trial. The use of this statistic was suggested in pj| ■ 



1st Order 

Size ±0.01 Power ±0.01 



Data 


x 2 


Xsurg 




x 2 


Xsurg 




25 


0.04 


0.04 


0.03 


0.48 


0.49 


0.49 


50 


0.05 


0.05 


0.04 


0.89 


0.89 


0.91 


100 


0.06 


0.04 


0.05 


0.98 


0.98 


0.98 


200 


0.08 


0.05 


0.05 


1.00 


1.00 


1.00 


400 


0.09 


0.05 


0.05 


1.00 


1.00 


1.00 



TABLE I. Estimated size of asymptotic and exact % 2 statistic for random 1st order Markov processes with 4 symbols, 2500 
trials. 

We break out each order in a separate table and list size and power versus data length. In the large sample limit 
both the exact and asymptotic methods should approach a power of 1 and a size equal to the significance level (0.05). 
The exact test is quite efficient; as little as 100 data points are needed for 1st and 2nd order tests, 200 for 3rd order, 
and 400 for 4th order. The asymptotic method is very slow to attain the ideal size even for the 1st order test (10 6 
sample size, not shown). For higher order tests we do not recommend use of the \ 2 distribution. There is no detectable 
difference between using the entropy rate as a test statistic and x 2 - As entropy rate is a simpler quantity to calculate, 
we recommend its use over \ 2 - 

Compared to the asymptotic x 2 test, the exact test is much slower; each step of Whittle's algorithm requires the 
computation of the determinate of the transition matrix. Considering that the transition matrix changes in only one 
entry at each step there is potential to improve the efficiency over our naive implementation. Even without such 



6 



2nd Order 



Data 




Size ±0.01 






Power ±0.01 


x 2 


Xsurg 




x 2 


Xsurg 




50 


0.07 


0.05 


0.04 


0.79 


0.59 


0.56 


100 


0.10 


0.05 


0.05 


0.98 


0.96 


0.98 


200 


0.10 


0.05 


0.05 


1.00 


1.00 


1.00 


400 


0.11 


0.05 


0.06 


1.00 


1.00 


1.00 



TABLE II. Estimated size of asymptotic and exact \ 2 statistic for random 2nd order Markov processes with 4 symbols, 2500 
trials. 



3rd Order 

Size ±0.01 Power ±0.01 



D&ta, X Xsurg ^surg X Xsurg ^surg 



50 


0.07 


0.01 


0.01 


0.44 


0.04 


0.03 


100 


0.18 


0.05 


0.05 


0.97 


0.56 


0.52 


200 


0.22 


0.05 


0.05 


1.00 


0.99 


0.99 


400 


0.22 


0.05 


0.05 


1.00 


1.00 


1.00 



TABLE III. Estimated size of asymptotic and exact \ 2 statistic for random 3rd order Markov processes with 4 symbols, 2500 
trials. 



optimizations, it is well within a desktop computer's ability to generate many thousands of surrogate sequences in 
minutes. Because each surrogate is generated independently, parallelization is straightforward. For our tables, each 
case involving 2500 trials, we opted to use 2500 surrogates, requiring the generation of 6.25 million surrogates per 
table entry. The standard error of our p- value estimates as well as our size and power estimates is then l/yi x 2500 
or ±0.01. If one doesn't need to analyze so many data sets, we recommend using 10,000 or more surrogates. 

In summary, we have described an exact test of the null hypothesis that a Markov chain is nth order versus the 
alternate hypothesis that it is (n ± l)-th order. At the heart of the test is an algorithm based on Whittle's formula, 
which efficiently produces surrogate data sets that have identical word transition counts as the observed sequence. 
Whittle's algorithm together with the entropy rate statistic make for a conceptually simple approach to Markov order 
hypothesis testing; no calculation of degrees of freedom or corrections for small sample size are necessary. 
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4th Order 



Data 




Size ±0.01 






Power ±0.01 


x 2 


Xsurg 




x 2 


Xsurg 


Hsurg 


50 


0.01 


0.00 


0.00 


0.02 


0.00 


0.00 


100 


0.12 


0.01 


0.01 


0.61 


0.03 


0.02 


200 


0.49 


0.04 


0.03 


1.00 


0.44 


0.42 


400 


0.74 


0.05 


0.05 


1.00 


0.99 


0.99 


800 


0.77 


0.05 


0.05 


1.00 


1.00 


1.00 



TABLE IV. Estimated size of asymptotic and exact \ 2 statistic for random 4th order Markov processes with 4 symbols, 2500 
trials. 



